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Abstract: This paper introduces a two-stage approach to the detection of people eating and/or 
drinking for the purposes of surveillance of daily life. With the sole use of wearable ac- 
celerometer sensor attached to somebody's (man or a woman) wrists, this two-stage approach 
consists of feature extraction followed by classification. At the first stage, based on the limb's 
three dimensional kinematics movement model and the Extended Kalman Filter (EKE), the 
realtime arm movement features described by Euler angles are extracted from the raw ac- 
celerometer measurement data. In the latter stage, the Hierarchical Temporal Memory (HTM) 
network is adopted to classify the extracted features of the eating/drinking activities based 
on the space and time varying property of the features, by making use of the powerful mod- 
elling capability of HTM network on dynamic signals which is varying with both space and 
time. The proposed approach is tested through the real eating and drinking activities using the 
three dimensional accelerometers. Experimental results show that the EKE and HTM based 
two-stage approach can perform the activity detection successfully with very high accuracy. 

Keywords: Wireless Sensor, HTM, Feature Extraction, Eating and Drinking, Euler Angle. 
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1. Introduction 

Tracking and identification of daily physical activities are key factors to evaluate the quality of life 
and health status of a person. Research on this field is well recognized in rehabilitation, assessment of 
physical treatment [1,2] and is shown to have significant impacts on the health care of elderly persons 
and patients [3]. For example. Great Eastern Life Insurance Company has defined the elder people's 
disability as: the inability of the Policyholder to perform at least 3 Activities of Daily Living (washing, 
dressing, feeding, toileting, mobility and transferring), even with the aid of special equipments, and al- 
ways to require the physical assistance of another person throughout the entire activity. In these activities, 
feeding means the ability to feed oneself food after it has been prepared and made available. Therefore, 
eating and drinking detection is a very important topic for daily life surveillance. Measurement of eat- 
ing or drinking activities in daily life or continuous recording of these activities at home would provide 
more reliable diagnosis of disabilities for hospitals or insurance companies. However, eating and drink- 
ing detection poses a challenge for the state of the art of the research in activity recognition [4], and few 
references or systematic methods can be found in the literature. 

In the daily life surveillance system, if the human activities (such as eating or drinking) can be tracked 
accurately, the results can help greatly and readily improve the ability of the identification of the whole 
system. Therefore, devices that can accurately track the pose of limbs in space are essential components 
of such a surveillance system. 

One method of tracking and monitoring activities is via tracking the pose of human limbs in space. 
The human limb tracking system can be classified as non-vision based and vision-based systems. Non- 
vision based systems use inertial, mechanical and magnetic sensors etc. to continuously collect move- 
ment signals. For example, the Micro-ElectroMechanical Systems (MEMS) inertial and magnetic sensor 
devices [5, 6, 7, 8] can be used in most circumstances without limitations (i.e. illumination, temperature, 
or space, etc.) and show better performance in accuracy against mechanical sensors. The main drawback 
of using inertial sensors is that accumulating errors (or drift) can become significant after a short period 
of time. Vision-based systems are widely used in recent years, such as [9, 10, 11, 12]. However, most 
vision-based approaches to human movement tracking involve intensive computations, such as temporal 
differencing, background subtraction or occlusion handling. In many cases, once a prior knowledge of 
an estimation of object kinematics is available, the expensive image detector array appears inefficient 
and unnecessary. 

Accelerometry-based activity analysis has been developed fast in recent years. Some prototype sys- 
tems which aim at monitoring daily activities [13], conducting gait analysis [14], etc. are reported. In our 
system, the 3D accelerometers are applied to collect raw measurement data of the moving arm and the 
server computer communicates with the sensor devices via the blue-tooth. The simple hardware struc- 
ture makes the data acquisition and processing easy. In this paper, a combined two-stage recognition 
approach is proposed for the eating and drinking detection for the daily life surveillance. A kinematics 
model of human forearm movements in three dimension is developed and the Extended Kalman Filter 
(EKE) is applied to extract features from the 3D accelerometer signals (raw data). This will greatly im- 
prove the recognition results compared to using the raw data as the inputs of the Hierarchical Temporal 
Memory (HTM) network. After the feature extraction, the HTM algorithm is applied for the recognition 
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purpose. HTM has the advantage that it can classify the dynamic signals which vary with both time and 
space due to its hierarchical memory and the belief propagation mechanisms. 

To the best of our knowledge, no work can be found for eating and drinking activity detection based 
on feature extraction algorithms. Our main contribution is the novelty of the two-stage approach and 
feature extraction applied to the eating/drinking detection. This method not only improves the accuracy 
of the activity detection compared to using the raw data, but also provides the basis for the time and 
space varying activities' identification by using HTM algorithm. 

The layout of the paper is as follows: Section 2 presents the related work to arm gesture classifications. 
Section 3 describes the system hardware and the wireless accelerometer we used in this paper. Section 4 
proposes feature extraction algorithm we derived. Section 5 describes how the HTM works and proposes 
our own design using HTM network for eating/drinking detection. Section 6 reports the simulation and 
experimental results. Conclusions and future work are given in Section 7. 

2. Related Work 

The following text describes relevant work that utilizes human model-based approaches involving 
hand and arm movements and gestures. The comparison between the HTM algorithm and the relevant 
work is also presented. 

The common methodologies that have been used for arm gesture recognition are: (1) template match- 
ing [15]; (2) neural networks [15]; (3) statistical method, and (4) multi-modal probabilistic combination 
[16]. The template approach compares the unclassified input sequence with a set of predefined template 
patterns. The algorithm requires preliminary work for generating the set of gesture patterns, and has poor 
recognition performance typically due to the difficulty of aligning the input with the template patterns 
[19]. 

By far the most popular recognition methods are the neural networks (e.g., [17]) and the statistical 
method-Hidden Markov Models (HMMs) (e.g., [18]). 

The Neural Network (NN) approach works by pre-determining a set of common discriminating fea- 
tures, estimating covariances during a training process, and using a discriminator to classify gestures. 
The drawback of this method is that features are manually selected and time consuming training is in- 
volved [15]. The NN does not exploit temporal coherence between the features as HTM do. 

The HMMs method is a variant of a finite state machine characterized by a set of states, a set of ob- 
servation symbols for each state, and probability distributions for state transitions, observation symbols 
and initial states [20]. The state transitions, which are hidden to the observer, generate an observation 
symbol from each state. The basic premise of the HMMs is to infer a state sequence that produces a 
sequence of observations. Learning the state sequence can help to understand the structure of the under- 
lying model that generates the observation sequence. The major drawbacks of the HMMs are: (1) they 
require a set of training gestures to generate the state transition network and tune parameters; (2) they 
make assumptions that successive observed operations are independent, which is typically not the case 
with human motion and speech [20]. 

In the statistical methods. Hierarchical Hidden Markov Model (HHMMs) [21] and Bayesian networks 
[22] come closest to the way HTM model time, modelling the nested structure of time in a hierarchy. 
However, the hierarchy that is exploited in HHMMs is only in one dimension (usually time). HTM has a 
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Figure 1. The sensor in our experiment 




hierarchy in space and time. This gives HTM several unique advantages while learning about the world. 
Moreover, the theory of HTM includes provisions for using activities and attention to learn the world. 

Support Vector Machine (SVM) [23, 24] is an efficient way to find boundaries in a high dimensional 
space that separate the various examples into their labelled categories. It does not make any assumptions 
about the hierarchical or temporal organization of the world and hence cannot exploit these properties 
for efficient learning. Since the underlying model of SVM is discriminative and not generative, it cannot 
be used to predict forward in time. 

HTM uses a unique combination of the following ideas [24]: 1) A hierarchy in space and time to 
share and transfer learning; 2) Slowness of time, which, combined with the hierarchy; enables efficient 
learning of intermediate levels of the hierarchy; 3) Learning of causes by using time continuity and 
actions; 4) Models of attention and specific memories; 5) A probabilistic model specified in terms of 
relations between a hierarchy of causes; 6) Belief propagation in the hierarchy to use temporal and 
spatial context for inference. 

From the above analysis of different approaches, we can see that the HTM method has the advantages 
as follows: it can classify the dynamic signals which are variable with both time and space because of 
the hierarchical memory and the belief propagation. Based on the features extracted by EKF, the HTM 
can greatly improve the accuracy of the activity detection. Compared to the different traditional methods 
mentioned above, it is a promising research tool in the activity detection and classification area. 

3. System Hardware 

In our system, the product of Alive Technologies named Mobile Cardiac Monitor is applied, see 
Figure 1 . It is a wireless health monitoring product for screening, diagnosis and management of chronic 
diseases, and for consumer health and fitness. Applications include the management of atrial fibrillation 
and heart failure, cardiac rehabilitation and fitness monitoring. Designed for use in the doctor's office, 
home or gym, the monitor uses wireless blue-tooth and mobile phone networks to immediately transmit 
accelerometer data or other data such as heart rate to a computer, PDA, or central monitoring center. 
Although it combines several sensors' functions, the 3-axis accelerometer inside the monitor is our 
concern. This device and one computer constitute our eating and drinking detection system. 
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4. Euler Angle Tracking for Arm Movement 

In this work, we attempt to recognize successive arm movements of the manoeuvring sub-phase to 
and from the mouth using the inertial sensor-accelerometer. In order to identify the arm movement, 
the features have to be extracted. Here, we use Euler angles a, /5, 7 to describe rotations or relative 
orientations of the arm. The angles a, (3, 7 describe successive rotations about the fixed x, y, z axes [25]. 
We consider the arm moving in a 3-D Cartesian coordinate system that we formulate in the next section. 
The system states which represent the arm movement features include the arm angular velocity and the 
arm Euler angles. 

4. 1. The Arm Movement Model 

Consider a rigid human forearm moving in the 3-D space. Figure 2 shows the kinematics of the 
human lower arm, where the elbow is fixed at 0 and the accelerometer is attached near the wrist, r is the 
distance between the center of the sensor and o that is defined as the system origin. The figure also shows 
the relationship between the reference coordinate system and the sensor coordinate system. X — Y — Z 
denotes the reference Cartesian coordinate system and X' — Y' — Z' is the sensor frame. Readings 
of accelerometer are along the axis of the frame. Here, we choose the table for eating/drinking as the 
X — Y plane of the reference coordinate system and the origin is chosen as the elbow of the person who 
is eating. Thus Z axis is automatically fixed. In Figure 2, dotted line L is the intersection line between 
the plane X — Y of the reference coordinate system and the plane X' — Y' of the sensor coordinate 
system. Thus the Euler angle for the arm movement system is a, P, 7 according to the definition in 
[25]. 

In the sensor coordinate system, the accelerometer' s readings are a^' , Og' , a^' in three axis directions. 
The gravity is g' in the sensor coordinate system. Assume that the Euler angles at time step k are 
a{k), P{k), 7(A;) in the reference coordinate system. The angular velocities are d(/c), P{k), 7(/c) and 
the sampling time interval is T. We assume that during the sampling time interval, the angular velocity 
is a constant. This is a reasonable approximation if sampled period is small. We can write the system 
state equations as follows: 

x(A; + l) = F(A;)x(A;)+v(A;), (1) 
where x(A; + 1) = + 1) a{k + 1) p{k + l) P{k + 1) -f{k + I) ^{k + l)f 

1 T 0 0 0 0 
0 1 0 0 0 0 



m 



0 0 1 T 0 0 

0 0 0 1 0 0 

0 0 0 0 1 T 

LO 0 0 0 0 IJ 



y{k) = [fa Va Vp ^ 

Va, Va, vp, v^, v^, Vj are noise of the respective state variables. They are assumed to be independent, 
zero mean Gaussian noise with distribution function: P{va, Va, V0, v-y, v-y) ~ A^(0, Q(A;)), 
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Figure 2. The 3-D arm movement system. 
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3, Qg, Qj, Qj are variances of the respective variables. 



In order to build up the estimation scheme, the sensor observation model is needed. In the sensor 
Cartesian coordinate system, the sum of the accelerations should be zero assuming that the arm moves 
with the constant velocity for each sampled period kT < t < {k + 1)T. The assumption of zero 
accelerations is reasonable if the sampling time is very small because of the velocity of the arm is almost 
unchanged during such a small period. Thus, we have: 



g{k) = J2a[{k) 



' a'i(k) ' 
a'^ik) 
a'^ik). 



(2) 
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where the g'(/c) and 3^i{k) are vectors and ai{k) is the reading from the accelerometer at time step k. 
According to the coordinate transformation relationship in [25], we have: 



0 
0 

-9 



(3) 



where R^j^z is the transformation matrix from the fixed reference frame to sensor frame, and g is the 
gravity in the reference coordinate system. According to [25], we know: 



R 



■xyz 



Rl i?2 R5 

R3 Ri Re 
.R7 Rs cos/5(fc). 



where 



Rl 
R2 

R3 = 

R4 = 



cosa(fc) cos (3{k) cos7(fc) — sma{k) sin7(/c), 
sma{k) cos/3(fc) cos7(A;) + cosq;(A;) sin7(A;), 

- cos a{k) cos P{k) sin7(A;) — sin a{k) cos7(A;), 

- sin a{k) cos P{k) sin7(A;) + cosa(A;) cos7(A;), 

i?5 = — sin P{k) cos7(A;), 
i?6 = smP{k) cos7(/c), 
i?7 = cos a{k) sin P{k), 
Rs = sina(fc) sin/5(fc), 
From equations (2) and (3), we can write the measurement model as: 



' a[{k) ' 
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+ w{k) 
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I.e. 



a[{k) = sin f3{k) cos7(/i;) ■ g + Wi{k) 
a2{k) = — sin j3{k) sin7(A;) • g + W2{k) 
a^{k) = — cos P{k) ■ g + Ws{k) . 



Or in matrix form. 



(4) 



(5) 



(6) 



(7) 



Z{k) = h{x{k))+w{k), 

where his a non-linear measurement function depending on sensor's measurement characteristic. w(A;) is 
a variable representing measurement noise in sensor. It is assumed to be zero-mean Gaussian distribution 
white noise. The covariance of w(A;) is R{k). 

The measurement model is a non-linear function, for the estimation purpose, the linearization is 
needed. Thus, we have to calculate the Jacobian matrix as follows: 
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cos (3 cos 7 ■ (7 0 — sin /? sin 7 ■ (yf 0 ' 
H(x(/c)) = — cos /5 sin 7 ■ (? 0 — sin /5 cos 7 ■ 0 
sm(3-g 0 0 0. 

From (6), it can be seen that a does not affect the accelerometer's readings. Ignore the first two rows 
of (1) (a, a are not observable), we can obtain the following system model for the estimation of the arm 
movement: 
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(8) 



Based on the above system model and observation model, the EKF can be used to estimate the system 
variables. 



4.3. Extended Kalman Filter for Estimation 

Based on the above system model, an extended Kalman filter [26] is used to implement the state 
prediction and update. 

Assume the system equation is as Equation (1). Given the estimate x(/c | k) of x{k), the predicted 
state x(A; + 1 | A;) using (1) is given by 

x(A; + 1 I A;) = F(A;)(x(A; I A;)). (9) 

The prediction error covariance matrix is approximated by: 

P(A; + 1 I A;) = F(A;)P(A; | A;)F^(A;) + Q(A;). (10) 

In view of the system observation model: 

Z(A;) =h(x(A;))+w(A;), (11) 

the predicted measurement is simply: 

Z(A; + 1) = H(x(A;))x(A; + 1 I A:). (12) 

Then, the difference between the measurement and the predicted observation, named the innovation, is 
given by: 

u{k + l) = Z{k + 1) - Z{k + 1) (13) 
= h(x(A;))-H(x(A;))x(A; + l I A:)+w(A;). (14) 

Thus, the covariance of the innovation is: 

s(A; + 1) = H(x(A;))P(A; + 1 | A;)H(x(A;))^ + (15) 

The EKF gain is given by: 

K(A; + 1) = P(A; + 1 I A:)H(x(A:))^s-^(A; + 1). (16) 
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Figure 3. A simple HTM network structure. 




The first layer: sensory data input 



We update the estimation using the following equations: 

±{k + I \ k + I) = ±{k + I \ k) + K{k + l)u{k + 1), (17) 

V{k + I \ k + I) = V{k + I \ k) - K{k + l)s{k + 1)K^(A; + 1). (18) 

5. Hierarchical Temporal Memory Algorithm 

5. 1. The HTM Structure and How it Works 

A HTM is structured as a hierarchy of nodes, where each node is performing the same learning 
algorithm. Figure 3 shows a simple HTM hierarchy. Measurement data from sensors (sensory data ) 
enters at the bottom. Exiting the top is a vector where each element of the vector represents a potential 
cause of the sensory data. Potential cause means the possible objects that give the sensory data. Each 
node in the hierarchy performs the same function as the overall hierarchy. That is, each node looks at 
the spatial-temporal pattern of its input and learns to assign causes to this input pattern. Spoken simply, 
each node, no matter where it is in the hierarchy, discovers the causes of its inputs. The outputs of nodes 
at one level become the inputs to the next level in the hierarchy. Nodes at the bottom of the hierarchy 
receive input from a small area of the sensory input. Therefore, the causes that they discover are the 
ones that are relevant to a small part of the sensory input area. Higher up regions receive input from 
multiple nodes below, and again discover the causes in this input. These causes will be of intermediate 
complexity, occurring over larger areas of the entire input space. The node or nodes at the top of the 
hierarchy represent high level causes that may appear anywhere in the entire sensory field. For example, 
in a visual inference HTM, nodes at the bottom of the hierarchy will typically discover simple causes 
such as edges, lines, and corners in a small part of the visual space. Nodes at the top of the hierarchy will 
represent complex causes such as dogs, faces, and cars which can appear over the entire visual space 
or any sub-part of the visual space. Nodes at intermediate levels in the hierarchy represent causes of 
intermediate complexity that occur over intermediate- sized areas of the visual space. 
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Figure 4. Our design for eating or drinking application. 



5.2. Learning Algorithm in One Node 



Sensory data 



The HTM nodes consist of the following components: 1) A spatial pooler, which finds meaningful 
coincidences in its inputs. 2) A temporal pooler which groups coincidences that occur nearby in time. 3) 
supervised mapper (for supervised learning) which associates coincidences with categories received from 
a category sensor. The spatial pooler, temporal pooler, and supervised mapper are the key substructures 
of the nodes that perform learning and inference [27]. 

During the learning mode, the spatial pooler analyzes the stream of input vectors in order to generate 
a coincidence matrix. This coincidence matrix quantizes the potentially huge space of all possible input 
vectors into a relatively small, finite set of representative canonical inputs. The algorithm applied in this 
phase is Maxdistance: if the distance between two vectors is smaller than a defined Maxdistance, they 
will be thought the same group, i.e., if the squared distance between an input vector x and an existing 
coincidence w is less than Maxdistance, the input vector is not considered novel and is (conceptually) 
pooled together with that existing coincidence; the details of the algorithm can be found in [27]. The 
coincidence matrix starts out empty. When the spatial pooler selects a particular input vector to be a 
coincidence, it simply appends this input vector to the coincidence matrix as a new row. For example, 
there are 6 vectors: Xi, X2, X3, X4, X5 and Xg, let W denotes the coincidence matrix. In the initialization, 
W = [xi]; If the distance between the second coming vector X2 and xi is less than Maxdistance, W will 
be unchanged; Otherwise, W will be changed to be [xi; X2\. Until the 6 vectors are all processed, W is 
then formed. 

Once the node is switched to inference mode, the spatial pooler no longer updates the coincidence 
matrix, and instead compares each new input vector to the coincidences in the coincidence matrix. Dur- 
ing inference, the spatial pooler computes a belief vector y for its input vector x. This output vector is 
a distribution over coincidences, so it contains one element for each row in the coincidence matrix. The 
spatial pooler computes the belief according to the equation for the jth coincidence: yj = e~li^~^^ ^ 
For example [27], during learning mode, the pooler generated a coincidence matrix W containing the 
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following three coincidences: 



rl 2 2 6 1 4 5 81 



W 



9 8 9 1 0 2 1 4 



55465466 



Assuming that the pooler receives the following input vector during inference: 

a;i = [3 3 4 5 2 4 6 9]. 

The input vector xi is presented, and the pooler computes the squared distances to each of the three 
coincidence vectors stored in W as 13, 160, and 27, respectively. The pooler converts these squared 
distances into belief values using the Gaussian model: 



Here we assume that the node has been configured with a equal to 5.0, the square root of Maxdistance. 
These three belief values are assembled into the spatial pooler output belief vector yi. 



The detailed algorithm (Gaussian Inference/Dot Inference)and equations can be seen in [27]. In this 
context, the term belief represents a generalized measure of the likelihood that a particular input vector 
X and a particular coincidence w both represent the same underlying real-world cause. 

The output vector y is handed off to the temporal pooler. In fact, the spatial pooler can be thought 
of as a pre-processor for the temporal pooler. It simplifies the temporal pooler task by pooling the vast 
space of input vectors into a relatively small set of discrete coincidences that are easier to handle. The 
coincidence matrix and the corresponding output vector is the input of the temporal pooler. The job of 
the temporal pooler is to group together temporally-related coincidences. During learning, the temporal 
pooler receives coincidence indices sent by the spatial pooler, and it keeps track of which ones occurred 
close together in time. The temporal pooler builds the time- adjacency matrix, which keeps track of tran- 
sitions between coincidences. After learning is completed, the pooler forms non-overlapping groups of 
coincidences, with each group containing coincidences that often followed each other during learning. 
For each of past coincidences, the pooler increments the value in the time-adjacency matrix correspond- 
ing to a transition from the past coincidence to the current coincidence. Thus the time- adjacency matrix 
can express the times that one coincidence occurred during the past learning process. During inference, 
the temporal pooler builds a list of groups from the time-adjacency matrix. It also creates a matrix of 
weights, using the coincidence frequency counts maintained by the spatial pooler. The temporal pooler 
uses its list of groups to convert incoming belief vectors to distributions over groups. The algorithm 
(maxProp) for this phase can be found in [27]. For each group, the maxProp algorithm finds the coin- 
cidence in that group with the highest value in the belief vector received from the spatial pooler. That 



g"13/2a2 



0.771 




0.583. 



0.041 



yi = [0.771 0.041 0.583]. 
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maximal belief in the group becomes the value for the group itself, and it is entered into the output vec- 
tor. To illustrate [27], let the input y from the spatial pooler be as follows, representing beliefs over five 
coincidences: 

?/=[0.04 0.12 0.30 0.01 0.22]. 

And let the groups be: group 0 contains coincidences 1, 3, and 4, and group 1 contains coincidences 0 
and 2. The output of the temporal pooler z is the following: 

z = [0.22 0.30]. 

Here, z contains the highest value in y for all coincidences in group 0 (0.22, from coincidence 4), as well 
as the highest value in y for all coincidences in group 1 (0.30, from coincidence 2). 

The top-level node(s) use the supervised mapper instead of the temporal pooler, and so it does not 
have any groups. The job of the mapper is simply to map coincidences from the spatial pooler to cate- 
gories obtained from the category sensor file. The mapper assumes that the lower levels have sufficiently 
discriminated the different categories to create a clean mapping between coincidences and output cate- 
gories. 

5.3. Design of HTM for Eating/Drinking Detection 

In our application, the data for eating /drinking from accelorometer is quite different from that in the 
image processing problem which is the typical application of HTM algorithm. In this case, we have to 
design the HTM framework and define how to use HTM for our experimental data set. 

The HTM applied for eating/drinking identification is designed as a 4-layer structure (Figure 4). We 
set the sensor data input layer to have 64 input nodes and the sensor data length for one activity is 300. 
We choose the second layer 16 nodes and the third layer 8 nodes, respectively. The top layer is the 
classification result layer with one node only. 

The input to the HTM is a buffer carrying 256 data sets long with each set consist of 3 values (acceler- 
ations in X, y and z directions, while the length of the buffer is time). One such buffer represents a single 
eating or drinking activity, like bringing a piece of broccoli on a fork to your mouth and putting the fork 
back down. Table 1 shows the data buffer with each line contents respectively. Then we construct 64 
level 1 nodes, each reading 3x4 (256/64=4 data sets for each node and 3 values of Xj, yi, Zi for each data 
set) patch of the values. Level 2 combines 4 level 1 nodes and level 3 combines 2 level two nodes, level 
4 is one node trained in supervised mode. 

During training phase, we scroll the sensor input data from left to right of the data buffer (see Table 
1), separating input with all 0 in the separating line when the single activity is done, this means in line 
62 in Table 1, we have to input a line with all elements zero. During inference, after the network was 
fully trained, for better accuracy we scroll the buffer in a similar manner, and sum the solutions for better 
accuracy. 

In order to explain clearly how to use the above data buffer during training phase, see Table 1, we 
use 256 samples of a total of 316 data (example for one activity), 3 numbers for each sample (x, y, 
z acceleration) into one line of a text file, so we have 768 numbers total in one line. Because one 
eating/drinking activity has 316 (even sometimes longer or shorter) of x,y, z samples, we need to repeat 
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Table 1. Data Buffer. 
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this line 61 (316 — 256 + 1 = 61) times, each time removing one sample (3 numbers) at left, and adding 
the next one at right. The 62th line would also have 768 numbers, but all being 0 (zeroes) to separate 
them from the next eating sample. This will form the sensor data file using the data buffer. 

A corresponding category file for the sensor data file should be set up for the purpose of training. For 
example, the category for eating activity repeated 61 times, would have 61 lines, each having one number, 
e.g. 1 meaning eating, and the zeroes line (blanks) should have 0 as category. The next activity should be 
for example, "drinking", and it would again have 61 lines as explained above, with corresponding lines 
in the category file with number 2 meaning "drinking", again separated with 0 from the next activity. 

We give the network many such activities to train on, in our experiment, at least a dozen of each and 
better hundreds (but it is time consuming). 

During the testing phase, we will use the test data that is totally different from the training data 
(different person's activities) by repeating the above steps to build up the data buffer. The HTM can 
output a very accurate result not only for the data that partially belongs to the training data but also for 
the data that are totally different from the training data. 

It is noted that the width of the data buffer for the training data will influence the results of the 
classification. The best width of the data buffer should be chosen at least one period data samples of the 
activity for the repetitive or continuous eating/dring. For example, if the repetitive or continuous eating 
has 1000 measurement data, we can choose 256 or 516 data for each fine of the data buffer, however, 
if 256 data is less than the data in one single eating activity, we have to choose 516 data in order to get 
better training results. 

6. Experimental Results 

The experiments are conducted by the system introduced in Section 3 which includes a three axis 
accelerometer. The accelerometer is attached on both wrists and eating/drinking activities are performed, 
see Figure 5 and Figure 6. The single eating/drinking activity and continuous eating/drinking activities 
are both tested in the experiments. The proposed two- stage eating/drinking detection approach is applied 
to the experimental data. Before the experiments, sensor calibration is done. From the calibration, we 
confirm that the readings of the accelerometer includes the acceleration due to gravity, i.e., when the 
wrist does not move, the sensor reading is (0, 0, —g). 

The first experiment is for eating activity detection. We acted by using the real plate, food and forks 
and performed single eating activity and continuous eating activity, respectively. The sensor raw data for 
the continuous eating activity can be seen in Figure 7. The figure gives the three axis acceleration value 
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Figure 5. The experiment with 3 axis accelerometer. 




Figure 6. The eating detection experiment. 




at each time instant. For single eating activity, the data is one period of the continuous eating data. The 
second experiment is for drinking activity. Several drinking activities are conducted. The drinking raw 
sensor data can be seen in Figure 9. 

From these figures, we know that the raw sensor data is very noisy. In order to detect and distinguish 
the eating and drinking activities more robust and effectively, the feature extraction algorithm proposed 
in Section 4 is firstly used. The feature extraction results are shown in Figures 8 and 10, respectively. 
The features extracted in these figures are Euler angles and their angular rate - P,P and 7,7. Although 
the angular a and a are not used here because we apply one sensor only (the information is not enough, 
for the reason, see equation (8)), the four features in the figures are enough for classifying the eating / 
drinking that is verified by the following experimental results from HTM. 

After feature extraction, the HTM is applied for the identification of eating and drinking based on 
the features extracted from the raw sensor data. In order to train the HTM and test it, the feature data 
is similarly designed as a data buffer that is explained in Section 5.3. It is noted that the data buffer for 
feature data is different from the raw data buffer because the data set has 4 data width rather than 3 data. 
See Table 2. We used the extracted features data buffer of a single eating and drinking activity to train the 
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Figure 7. The raw sensor data from the 3 axis accelerometer of eating action. 



The accelerometer's readings 
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Figure 8, The features extracted from the 3 axis accelerometer of eating activity. 



The parameters extracted from 3D eating data 
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Figure 9. The raw sensor data from the 3 axis accelerometer of drinking activity. 



The accelerometer's readings for drinking 
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HTM by repeating the data file 10 times. Then the feature data of the different single eating and drinking 
activities is classified through the trained HTM. The Monte Carlo runs of 20 times is performed. Results 
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Figure 10. The features extracted from the 3 axis accelerometer of drinking activity. 



The parameters extracted from 3D drinking data 
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Table 2. Data Buffer of Features. 
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are always 100% accurate for both eating and drinking detection. 

The other experimental test is for continuous eating and drinking activity detection with the same 
trained HTM network as the above test. We also conducted 20 Monte Carlo runs on each continuous 
activities' classifications and calculated the average successful rate for the classifications. Table 3 gives 
the results of 10 different group continuous activities that are obtained from different people at different 
time. 

For the comparison purpose, we also use the HTM network itself to detect the eating and drinking 
activities (using raw sensor data as the input of the HTM without the first step-feature extraction). In 
this case, the raw sensor data for single activity is firstly made into the data buffer so as to train the HTM 
network. Then the raw sensor data of the continuous activities is classified by the trained HTM. The 
experimental results are listed in Table 4. 

From the two tables, we can find that the average successful rate is greatly improved when we apply 
the feature extraction algorithm compared to the case using the raw sensor data as the input to the HTM. 

7. Conclusions 

This paper presented a novel algorithm which was based on EKF and HTM for eating/drinking detec- 
tion for human activity monitoring of daily life in wireless environments. The proposed method used a 
simple hardware structure with wireless accelerometers so that the system was easily set up. If a smaller 
and less expensive accelerometer was used, such as imote2 [28], the system may be more ambulatory and 
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Table 3. The success rate of the eating/drmking detection by the HTM algorithm based on 
raw sensor data. 
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Table 4. The success rate of the eating/drinking detection by the HTM algorithm based on 
features. 



Activities 
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Continuous Eating 1 
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more affordable. For the proposed algorithm itself, the experimental results show that the new scheme 
can achieve significant classification results even for the very noisy data. However, there are still many 
issues remaining for future study. Real time algorithms based on dealing with both time-varying and 
space varying signals or multi-modality sensor based algorithm are both challenging problems for fur- 
ther investigations. For example, if the person uses the same movement for eating and drinking, the 
results will be wrong because the HTM will classify it as the same activity in this case. How to tackle 
this false detection problem and improve the successful rate of the activity detection is a key problem. 
The multi-modality sensor should be used to obtain more information on the eating/drinking activities 
so as to improve the successful rate of the activity detection in the future. 
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